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Abstract 



Nonparametric, or kernel, estimation of item response curve (IRC) is a concern theoretically 
and operationally. Accuracy of this estimation, often used in item analysis in testing programs, 
is biased when the observed scores are used as the regressor because the observed scores are 
contaminated by measurement error. In this study, we investigate the deconvolution kernel 
estimation of IRC, correcting for the measurement error in the regressor variable. Using item 
response theory (IRT) simulated data and some real data, we compared the traditional kernel 
estimation and the deconvolution estimation of IRC. Results show that in capturing important 
features of the IRC, the traditional kernel estimation is comparable to the deconvolution kernel 
estimation in item analysis. 
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1 Overview 



Nonparametric item response theory (NIRT) uses nonparametric regression techniques 
extensively. (See Douglas Sz Cohen, 2001; Lee, 2007; Meijer, 2004; and Sijtsma, 1998.) 
Characterized by a nonparametric function of the latent trait, NIRT differs from parametric 
item response theory (IRT), which is used in Rasch and the two-parameter and three-parameter 
logisitic (2PL and 3PL) models. Nonparametric estimation has been the focus of many studies 
Ramsay (1991) and Wand and Jones (1995) described a nonparametric regression method 
that can estimate item response curves (IRC). This method, based on kernel smoothing (e.g., 
Silverman, 1986), is implemented in TESTGRAF (Ramsay, 1998). Livingston and Dorans (2004) 
discussed the use of Ramsay’s (1991) method to estimate the response curves for each answer 
option for a multiple choice item. Nonparametric estimation is often used in classical test theory 
(CTT), and recently Lee (2007) compared the kernel smoothing method and other regression 
methods with monotonicity constraints to estimate item characteristic curves (ICC). We focus 
only on the nonparametic kernel smoothing methods because the monotonicity constraint on the 
ICC estimation will not help to identify problematic items. Well- written items should reveal 
a monotonically increasing IRC for the key, as shown in the left panel of Figure 1. The right 
panel of Figure 1 shows a decreasing IRC for the key, which indicates that the item may be 
problematic. An increasing IRC for the top scores of the nonkey also indicates a problematic 
item. Psychometricians, test developers, and clients find plots similar to those shown in Figure 1 
helpful because they are easily interpreted. In the kernel smoothing method (Ramsay, 1998), the 
response variable is the item score (0 or 1 for the dichotomous items) or the proportion of right 
answers among examinees; the regressor (or independent variable) is the ability or the total true 
score of the examinee. In practice, however, neither the ability nor the true score is available. 

In plotting the IRC, the observed score or scaled score is used, especially for testing programs 
that use observed scores. (See Figures 1 to 8 in Livingston & Dorans, 2004.) The accuracy of the 
estimated IRC is a concern because the observed scores are contaminated by measurement error. 
Nonparametric regression in the presence of measurement error has been studied intensively in 
the area of statistics. Carroll, Maca, and Ruppert (1999) showed that the simple/nave/traditional 
nonparametric regression estimate was inconsistent. Fan and Troung (1993) proposed a 
deconvolution kernel regression method and, under difference measurement error distributions, 
obtained asymptotic results. Delaigel, Fan, and Carroll (2009) extended the deconvolution method 
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to local polynomial regression. These methods produce asymptotically unbiased estimators; the 
trade-off is that the convergence rate is discouraging. Wand (1998), however, indicated in a 
detailed analysis that the deconvolution method can perform well for lower levels of measurement 
error in reasonable sample sizes. (See also Carroll, Ruppert, Stefanski, & Crainiceanu, 2006.) 
With this information, it is important to examine if correction for measurement error leads 
to improved nonparametric estimates of IRCs. Would nonparametric regression estimation of 
IRC with correction for measurement error result in a significant improvement in identifying 
problematic items? In this study, we discuss the kernel estimation method (Ramsay, 1991) and 
then introduce the deconvolution estimation method (Fan & Troung, 1993) in section 2. Section 
3 discusses applications of the deconvolution kernel regression method. Naive kernel regression 
is one of the commonly used nonparametric methods in practice (Livingston &; Dorans, 2004; 
Ramsay, 1998), so we provide a comparison between the naive kernel regression and deconvolution 
kernel regression of the IRC function by using simulated data and operational data. In section 4, 
we discuss estimating the IRC function in practice. The distribution of measurement error in both 
CTT and IRT models are addressed in the appendix. 





Figure 1. Small item response curves (IRCs) of two hypothetical items. 



2 Nonparametric Regression With Measurement Error 

Suppose (Xi,Yi),--- ,(X n ,Y n ) are independently and identically distributed (i.i.d.) 
random samples from ( X,Y ). We are interested in estimating the smooth regression curve 
m{x) = E(Y\X = x). In the context of educational testing, X{ and Y) could denote, for examinee 
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i, the true total score and item score, respectively. When X is observable, at each point x, the 
(naive) kernel smoothing estimator (Ramsay, 1991; Wand & Jones, 1995) is the weighted average 
ofW 

Mx) = X K ( X -^) Y ilX K (X^)’ (!) 

i= 1 2—1 

where K( ) is the kernel function, and h is the bandwidth. 

However, sometimes X is not observable. Instead, (Z\, Yi), • • • , ( Z n ,Y n ) are observed, where 
Z = X + e, e is the measurement error and is independent of ( X , Y). For example, Z{ could be the 
observed total score instead of the true total score. The deconvolution method (Fan & Troung, 
1993), that can be used to provide a statistically consistent estimator of m based on (Z,Y), is 
described next. 

Notice that Z = X + e, and X and e are independent. Then the probability density function 
fz(-) °f Z is a convolution of the two density functions fx(') and /„(•). That is 

fz(z) = J fx(z - x)f e (x)dx, i.e. fz = fx * fe- 
Using the Fourier transformation property (e.g., Stein & Weiss, 1971), one has 

Xz = Xx + X e , 



where X\ is the Fourier transformation of the density function of X. For example, 




exp(—2mxt)fx{x)dx, 



where i = \/—l. Now the convolution problem is simplified as an addition problem. The density 
function fx can thus be obtained by an inverse Fourier transform. This is the idea of the 
deconvolution method. The deconvolution kernel estimator of m(-) is 



mix 



n 

= E k ‘ 

2—1 



, x — Xj 



WYk- 



, x — X-. 



2—1 



(2) 



where 



K ' (x) = hj (3) 

and where 4>k is the Fourier transform of the kernel function K(-), and (j) e is the characteristic 
function of e: 






J exp (itx)f e {x)dx. 
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Thus, the distribution f e (x) of the measurement error should be known in order to use the 
deconvolution method. See the appendix for some results on distributions of measurement error. 



The deconvolution estimation produces an asymptotically unbiased estimator, but the 
convergence rate of the deconvolution estimator is slower than the naive kernel smoothing 
estimator (Fan & Troung, 1993). 

Fan and Troung (1993) showed that the deconvolution estimator is robust to different choices 
of kernel functions. Among these kernels, one has the following simple form: 

which will be used in the IRC estimation later. This K* is different from a regular kernel K . 
Figure 2 displays how different K* is from K for different standard errors of measurement (SEMs) 
and band widths when K is a normal density ip. 

From Figure 2, we can observe that when the standard deviation of error a e (denoted as SEM 
in the plots) is very small, K* and ip are hardly distinguishable for a wide range of bandwidths h. 
As a e increases, K* deviates more from ip. But h has the opposite effect on K*\ as h decreases, 
K* deviates more from ip. 



3 Applications 

3.1 Simulated Data 

We use the 2PL IRT model to simulate data in order to have a true item characteristic 
function (IRF) to compare with the nonparametric estimations of IRC. For a 2PL model, the IRF 
for item j is given by (A6). Given a test form with ( a.j, f3j),j = 1 ,•••,</, the true score X given 6 
is 

J J p aj6—0j 

3 = 1 3 = 1 

which is a monotonic function of 0. Thus there exists a one-to-one relationship between the true 
score X(9) and the ability parameter 9. The plot of P(9;j) against X(9) is our criterion in the 
comparison of different ways of nonparametric IRC estimation of item j . 

In the simulation, test lengths are 20, 40, and 80 for short, medium and long tests. Sample 
sizes are 100, 500, 1,000, and 5,000 for small, medium, large, and very large samples The ability 
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Figure 2. A comparison of the deconvolution kernel K*(x ) (solid line) and the normal 
density i/j(x) (dashed line) for different bandwidth h and standard deviation a e of error. 
In each plot, the x-axis indicates the independent variable x, and the y-axis indicates 
the dependent variable K*{x) or ^(x). Notice that the y-axis scales are different in the 
plots. 
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follows a normal distribution N( 0, 1); the item difficulty follows the same distribution as the 
ability distribution; the item discrimination follows a normal distribution JV(1, .25). 

Calculation of rh{x ) requires specification of a bandwidth h. For the naive kernel smoothing, 
a popular approach is to make an asymptotic expansion of MISE mean integrated squared error 
(MISE) 

MISE = E( j [m(x) — m(x)} 2 f (x)dx ) , 

where f(x) is the density function of X. The optimal bandwidth is the one that minimizes MISE 
(Ruppert, Sheather, Sz Wand, 1995): 

, r R(K) i i/s _ 1/5 /yl , 

M 1 SE ^\-[i 2 (K) 2 fmW{xyf(x)dx\ n ’ 

where R(K ) = f K 2 (x)dx, ^2 {K) = f x 2 K(x)dx, and mP‘\x) is the second derivative function 
of m(x). Replacing the unknown integrals by estimators gives the plug-in bandwidth. In the 
following estimation, the bandwidth for the naive kernel smoothing is /imise- Notice that this 
/imise helps to produce a smooth regression curve. 

Let the root sum squared error (RSE) of the nonparametric estimate of the IRC be 

n 

RSE = , - m{xi)) 2 , 

\ i=1 

where rh(x) is the estimated function and m(x) is the true function. Since there is no available 
optimal bandwidth formula for the deconvolution estimation, we experimented with different 
bandwidths and found that the deconvolution estimation with a half of h-MiSE produces a IRC 
with a minimum RSE or even smaller one. Therefore, in the following estimation, the bandwidth 
for the deconvolution method is chosen to be /imise/2. 

Table 1 compares the RSE of the naive estimators (RSE.n) and RSE of the deconvolution 
estimator (RSE.d) of IRC for one item under different sample sizes, test lengths, and ability 
variance. Notice that the bandwidth is chosen as in (4), which minimizes the MISE for the naive 
kernel estimation, not the deconvolution estimation. The RSE.d could have been slightly improved 
had we adjusted the bandwidth for individual item. However, the improvement was found to be 
negligible in our analysis. 

Figures 3 to 5 compare the deconvolution estimator and the naive kernel estimator for 
three simulated data sets. The measurement error is assumed to have normal distribution 
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N(0, <t € ), where a e is estimated by SEM.th in (A7) in the appendix. It can be observed that 
the deconvolution estimators and the naive estimators behave similarly with respect to the 
criterion for all concerned test lengths, numbers of examinees, and standard deviations of ability 
distributions. Figure 6 displays the IRC estimates with different ability standard deviations. 

3.2 Real Data Examples 

We now compare the two nonparametric IRC estimators using a data set from an operational 
test. The test has 147 multiple-choice items. Its SEM of 6.9, which is calculated from (A2) in 
the appendix, is used as a € in the calculation of the deconvolution estimator. Also, the error 
is assumed to follow a normal distribution. Figure 7 displays the IRC plots of 12 items on the 
test. For comparison, we also included Ramsay’s estimator (Ramsay, 1991) in the plots. All the 
estimators of IRC capture the characteristics (shape, monotonicity, etc.) of the IRC in the same 
way. Note that each panel in Figure 7 shows the nonparametric estimate corresponding to only the 
key of multiple-choice items. It is possible to compare plots like those in Figure 1 by computing 
the nonparametric estimates of the nonkey answer options using both the deconvolution and 



Table 1 

Comparison of RSE.n and RSE.d for One Item 





ere 




.1 






1 






5 






J 


20 


40 


80 


20 


40 


80 


20 


40 


80 


n = 100 


RSE.n 


0.36 


0.53 


0.75 


1.19 


1.46 


0.72 


0.65 


1.24 


0.42 




RSE.d 


0.39 


0.72 


0.87 


1.02 


1.38 


0.66 


0.59 


1.25 


0.39 


n = 500 


RSE.n 


0.54 


0.40 


0.27 


0.43 


1.60 


1.06 


1.38 


1.38 


0.77 




RSE.d 


0.79 


0.38 


1.31 


0.47 


1.42 


1.01 


1.23 


1.27 


0.81 


n =1,000 


RSE.n 


0.39 


0.53 


0.91 


1.50 


1.35 


1.28 


1.16 


0.49 


0.92 




RSE.d 


0.30 


1.20 


1.80 


0.91 


1.05 


1.11 


1.75 


1.07 


0.88 


n = 5,000 


RSD.n 


1.18 


0.23 


0.64 


2.12 


2.25 


4.03 


1.32 


1.31 


1.14 




RSD.d 


1.08 


0.80 


0.40 


0.91 


1.70 


4.00 


2.56 


0.93 


1.10 



Note. RSE.d = root sum squared error (RSE) of the deconvolution estimator, RSE.n = RSE of 
the naive estimators. 
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Figure 3. Item response curve (IRC) plots of several items in a test of length 20 and 
sample size 100. In each plot, the x-axis indicates the true score, and the y-axis is the 
probability of answering the item right for a true score x. 















Figure 4. Item response curves (IRC) plots of several items in a test of length 40 and 
a sample size of 500. In each plot, the x-axis indicates the true score, and the y-axis 
is the probability of answering the item right for a true score of x. 
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Figure 5. Item response curve (IRC) plots of several items in a test of length 80 and 
a sample size of 1,000. In each plot, the x-axis indicates the true score, and the y-axis 
is the probability of answering the item right for a true score of x. 



10 















Figure 6. Item response curve (IRC) plots of several items in a test of length 80 and a 
sample size of 1,000. In each plot, the x-axis indicates the true score, and the y-axis is 
the probability of answering the item right for a true score of x. The ability standard 
deviations varies among 0.1, 1, 3, and 5, as depicted at the bottom right corner of 
each plot. 
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Figure 7. Item response curve (IRC) estimations for some items in a real test. The 
standard error of measurement for Ramsay’s estimator (SEM.r) = 6.9. In each plot, 
the x-axis indicates the true score, and the y-axis is the probability of answering the 
item right for a score of x. 
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naive kernel estimation. Such comparisons (results not shown) also showed virtually no difference 
between the two methods. 



4 Discussion 

We investigated the deconvolution method under a variety of conditions to correct the 
influence of measurement error. The naive method and the deconvolution method produce similar 
results in our study. The similarity may be attributed to the relatively small SEM (a e < 10) 
and the relatively large bandwidth ( h > 3). In this case, the naive kernel function K(-) and the 
modified kernel function K*(-) in (3) are close, and thus the two methods yield similar estimations. 

When a study’s main focus is to investigate an item’s IRC property (i.e., whether an item 
possesses the property that the test takers’ chance of obtaining the right answer increases with his 
or her ability), the naive kernel estimation is competitive compared to other statistical methods 
with error correction. The deconvolution method provides an asymptotic unbiased estimation; 
however, the difficulty lies in the unknown distributions of measurement error and unavailable 
optimal bandwidth choices in practice. 

Assuming that the variance of measurement error is a constant is another limitation of the 
deconvolution method. The measurement error has a heterogenous distribution in many item 
response IRT models. It is worth investigating whether significant improvement can be expected 
by using regression models with correction of heterogenous measurement error. 
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Appendix 

Measurement Error 



To apply the deconvolution method, one needs to know the distribution of the measurement 
error. Here, we discuss asymptotic results of error distribution in both CTT and IRT models. 



Measurement Error in Classical Test Theory (CTT) 

Let n be the number of examinees, and J be the number of items on the test. In CTT, the 
observed score Z and the true score X have the following relationship: 

Z = X + e 



where e is the measurement error independent of X. 

Assumption 1 . Z = Ylj=i Zj, and X = J^/j-i Xj, where Zj and Xj are the item score and true 
item score, respectively. 

Assumption 2. Zj — Xj are independent of Z{ - X{ for j ^ i. 

Notice that {Zj — Xj, j = 1, • • • , J} are bounded and independent random variables by 
Assumptions 1 and 2. Then, by the Lindeberg-Feller theorem (Durrett, 1995), for J — >• oo (that 
is, for a very long test), 



Z — X 

<?e 



£*=i( z i - x i 



CTf 



N (0, 1). 



(Al) 



Assumptions 1 and 2 are reasonable conditions. Assumption 1 says that the test score is a sum of 
item scores while Assumption 2 says that the measurement errors are independent of each other. 

Since X is not observable, a e can not be calculated directly. However, the test reliability 7 
can be estimated in many ways (Haertel, 2006), and then can be estimated as 






<Jz\J 1 - 7- 



(A2) 



Measurement Error in Item Response Theory (IRT) 

An important difference between CTT and IRT is the treatment of measurement error. CTT 
assumes that the variance of error is the same for each examinee, but IRT allows it to vary. The 
unidimensionality, local independence, and monotonic increment of the item response function 
(IRF) are assumed here conventionally. The observed score Z and the true score X have the 
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following relationship for given ability 9: 

J J 

Z{0) = £ Zj(Q), X{6) = E{Z{d)) = £p#), 

i = i j=i 

where Zj(9) is the dichotomous item score of the examinee on item j, E(-) is the expectation 
operator, and pj(0) is the probability of obtaining the right answer on item j for an examinee with 
ability 9. Let G(9) be the distribution function of 9 with mean fig and standard deviation ag. 
Denote the conditional measurement error variance given 9 as 

a 2 (0) = E[(Z - Xf\ 6} = E[(Z - E(Z\9)\9} 2 = Var(Z|0) 



and the unconditional variance of error can be expressed as (Kolen, Zeng, &; Hanson, 1996) 

a e 2 = f a 2 (Z\8)dG(9) = £ [ Pj (0 ) (l - Pj {0))dG{6). 

Je • , Je 



(A3) 



3 = 1 



For a long test with fixed item parameters, that is, when J — >• oo, the standardized score 
given 9 is given by 

J 






(A4) 



by the Lindeberg-Feller theorem again. Note that the variance of error is a function of 9, which 
is different from (Al) in the CTT model. Also note that the right hand side of (A4) is a random 
variable independent of 9. 

Assumption 3. Suppose that 



Var(<7 e (0)) > 

Then, under Assumption 3, (A4) can be rewritten as 



Z(0)-X{0) 

<?e 



N (0, 1). 



(A5) 



Assumption 3 requires that the ratio of variation of a 2 (9) is very small as the ability variable 9 
varies. Under the CTT model, as cr 2 (9) is a constant for all examinees, Var(u e (0)) = 0; hence this 
assumption is true. When the ability of the examinees is not too heterogeneous, that is, when 
Var(cr e (0)) is very small compared to a 2 , Assumption 3 is likely to be true too. 
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Numerical Results 



We use the 2PL IRT model to simulate data to investigate the error distributions and SEM. 
The 2PL model assumes that 



p(Zj = m 



gCXj 6 (3 j 

1 + 



(A6) 



In the simulation, test lengths are 20, 40, and 80 for short, medium and long tests. Sample sizes 
are 100, 500, 1,000, and 5,000 for small, medium, large, and very large samples. The distributions 
of ability, f3 parameter, and a parameter follow N(0,ag), N(0,ag), and 1V(1,0.25), respectively. 
For 2PL IRT models, (A3) becomes 



cr 



2 




gOj 9 Bj 
(1 + 



6)d6 , 



(A7) 



where i/j(6) is the normal density function. 

Three SEMs are calculated for each simulated data set: SEM.th is from the theoretical 
formula (A7) by using Gaussian quadrature approximation of integration; SEM.r is obtained 
from (A2), where the reliability is estimated using the Cronbach’s alpha method (e.g., 

Haertel, 2006); and SEM.ern is the empirical SEM using data and true scores, that is, 
SEM.em = Y3i=i(Zi — W) 2 /(n — 1), where Zi and X t are the observed score and true score for 
examinee i, and n is the number of examinees. The three SEMs are compared in Table Al. None 
of the three SEMs are affected by the sample size, but all are affected by the test length and the 
variability of the ability. As expected, as tests become longer, SEM increases; when the sample 
ability is more heterogeneous, SEM is slightly smaller (because the test reliability is larger for a 
more heterogeneous population). Overall, the SEMs are relatively comparable across the three 
different methods of calculation. 

The variance and mean square ratios of cr e (d), as in the Assumption 3, are displayed in Table 
A2. The ratio is positively proportional to the size of ability variation. Verification of normality 
of the error distribution is displayed in QQ plots in Figures Al to A3. The points in the Q Q 
plot are formed by pairs of estimated quantiles from the data (e^, • • • ,e n ) and estimated quantiles 
from n observations of a normal distribution N(0,a e ). Both axes are in units of their respective 
data sets. If the two sets come from a population with the same distribution, the points should 
fall approximately along a 45-degree reference line. From these QQ plots, we observe that the 
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Table A1 



Comparison of the Standard Error of Measurement (SEM) for Tests of 
Length 20, fO, and 80 









SEM.r 






SEM.th 






SEM.em 




Test length 


n\ ae 


.1 


1 


5 


.1 


1 


5 


.1 


1 


5 




100 


2.21 


1.92 


1.38 


2.23 


1.92 


1.02 


2.36 


1.97 


1.17 




500 


2.24 


1.94 


1.39 


2.23 


1.97 


1.08 


2.22 


1.86 


1.16 


20 


1,000 


2.23 


1.92 


1.40 


2.23 


1.92 


1.12 


2.14 


1.99 


1.05 




5,000 


2.23 


1.85 


1.38 


2.23 


1.91 


1.06 


2.26 


1.94 


1.09 




100 


3.16 


2.67 


1.95 


3.15 


2.70 


1.43 


2.73 


2.71 


1.42 




500 


3.16 


2.65 


1.98 


3.15 


2.70 


1.51 


3.23 


2.76 


1.49 


40 


1,000 


3.15 


2.75 


1.97 


3.16 


2.67 


1.47 


3.22 


2.63 


1.45 




5,000 


3.15 


2.73 


1.96 


3.15 


2.71 


1.38 


3.11 


2.63 


1.56 




100 


4.46 


3.87 


2.77 


4.46 


3.81 


2.17 


4.45 


3.83 


2.17 




500 


4.46 


3.85 


2.77 


4.46 


3.81 


2.13 


4.53 


3.44 


2.19 


80 


1,000 


4.46 


3.80 


2.77 


4.46 


3.80 


2.01 


4.57 


3.94 


2.38 




5,000 


4.46 


3.82 


2.75 


4.46 


3.90 


2.03 


4.51 


3.88 


2.12 



Note. SEM.em = empirical SEM; SEM.r = SEM obtained from (A2), where the reliability is 
estimated using the Cronbach’s alpha method; SEM.th = SEM obtained from theoretical formula 
(A7) by using Gaussian quadrature approximation of integration. 
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empirical distribution of error does not deviate much from a normal distribution under a variety 
of conditions when samples are relatively large (n > 500). 



Table A2 



Ratio of Variance and Mean- Square for cr e ($) 







J = 20 






J = 40 






J = 80 




n\ a e 


.1 


1 


5 


.1 


1 


5 


.1 


1 


5 


100 


.0023 


.1380 


.2701 


.0023 


.1034 


.2384 


.0028 


.1642 


.3560 


500 


.0026 


.1692 


.3958 


.0026 


.1237 


.2522 


.0024 


.1163 


.3202 


1,000 


.0026 


.1913 


.4312 


.0027 


.1531 


.3144 


.0025 


.1235 


.3318 


5,000 


.0026 


.1226 


.2620 


.0024 


.1348 


.2388 


.0024 


.1184 


.3195 



Note. J = test length. 



From IRT simulations, the measurement error can be approximated by a normal distribution 
for a moderate long test (J > 40) and a medium sized population (n > 500). Even for shorter 
tests with smaller sample sizes, the normal approximation is still acceptable sometimes. However, 
whether real data have such a property is unknown since the true scores are unobservable. 
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Figure Al. Error distribution comparison with a normal distribution for a test of 
length 80. In each plot, si is the standard deviation of the ability, N is the sample size 
in the simulation, and SEM.em is the empirical SEM. 
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Figure A2. Error distribution comparison with a normal distribution for a test of 
length 80. In each plot, si is the standard deviation of the ability, N is the sample size 
in the simulation, and SEM.em is the empirical SEM. 
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Figure A3. Error distribution comparison with a normal distribution for a test of 
length 80. In each plot, si is the standard deviation of the ability, N is the sample size 
in the simulation, and SEM.em is the empirical SEM. 
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